An Efficient and Versatile Query Engine for TopX Search
نویسندگان
چکیده
This paper presents a novel engine, coined TopX, for efficient ranked retrieval of XML documents over semistructured but nonschematic data collections. The algorithm follows the paradigm of threshold algorithms for top-k query processing with a focus on inexpensive sequential accesses to index lists and only a few judiciously scheduled random accesses. The difficulties in applying the existing top-k algorithms to XML data lie in 1) the need to consider scores for XML elements while aggregating them at the document level, 2) the combination of vague content conditions with XML path conditions, 3) the need to relax query conditions if too few results satisfy all conditions, and 4) the selectivity estimation for both content and structure conditions and their impact on evaluation strategies. TopX addresses these issues by precomputing score and path information in an appropriately designed index structure, by largely avoiding or postponing the evaluation of expensive path conditions so as to preserve the sequential access pattern on index lists, and by selectively scheduling random accesses when they are cost-beneficial. In addition, TopX can compute approximate topk results using probabilistic score estimators, thus speeding up queries with a small and controllable loss in retrieval precision.
منابع مشابه
TopX - Efficient and Versatile Top-k Query Process-ing for Text, Semistructured, and Structured Data
This paper presents a comprehensive overview of the TopX search engine, an extensive framework for unified indexing and querying large collections of unstructured, semistructured, and structured data. Residing at the very synapse of database (DB) engineering and information retrieval (IR), it integrates efficient scheduling algorithms for top-k-style ranked retrieval with powerful scoring model...
متن کاملTopX: efficient and versatile top-k query processing for text, structured, and semistructured data
TopX is a top-k retrieval engine for text and XML data. Unlike Boolean engines, it stops query processing as soon as it can safely determine the k top-ranked result objects according to a monotonous score aggregation function with respect to a multidimensional query. The main contributions of the thesis unfold into four main points, confirmed by previous publications at international conference...
متن کاملIncremental Relevance Feedback for TopX submitted by Osama Sammodi
TopX is a highly efficient and effective search engine for ranked retrieval of XML and plain text data. However, for some difficult queries, the results provided by TopX are not yet completely satisfying. Towards the solution of this problem, an extensible framework has been proposed that incorporates feedback from the user to generate a better, expanded query. In this thesis, we integrate the ...
متن کاملSimilarity Measures for Query Expansion in TopX
TopX is a top-k retrieval engine for text and XML data. Unlike some other engines, TopX includes an ontology. This ontology allows TopX to use techniques like word sense disambiguation and query expansion, to search for words similar to the original query terms. These techniques allow finding data items which would be ignored for the original source query, due to missing of words similar to the...
متن کاملP2P Web Search: Make It Light, Make It Fly (Demo)
We propose a live demonstration of MinervaLight, a P2P Web search engine. MinervaLight combines the (previously separate) focused crawler BINGO! (to harvest Web data), the local search engine TopX, and our P2P Web search system MINERVA under one common user interface. The crawler unattendedly downloads and indexes Web data, where the scope of the focused crawl can be tailored to the thematic in...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005